This tutorial is for pyVDJ v0.1.1. The package is available here: https://github.com/veghp/pyVDJ
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc # v1.4.3
import pyvdj
import upsetplot
sc.settings.set_figure_params(dpi=100)
adata = sc.read_h5ad('data/anndata.h5ad')
adata.shape
Our example dataset contains T lymphocytes:
sc.pl.umap(adata, color=['CD8A', 'CCL5', 'GZMH'], use_raw=False)
The Th and the CD8+ Tc are clearly separated. A subcluster of Tc expresses CCL5, granzymes and other effector proteins. We use this dataset to demonstrate the functions of pyVDJ. The dataset consists of three donors, and two conditions (control and gain-of-function).
sc.pl.umap(adata, color=['donor', 'status'], use_raw=False)
To load in the VDJ data, we construct filepaths and a dictionary linking files to sample names. The easiest way is to prepare a manifest that lists the GEX sample names next to the V(D)J 10x directory names in a csv file.
manifest = pd.read_csv('sample_manifest.csv')
manifest = manifest[manifest['metadata'].isin(adata.obs['metadata'].unique())] # keep the ones in adata
#~ manifest = manifest[manifest['VDJ'].notnull()] # can remove samples which have no VDJ (we have VDJ for all GEX)
# Construct paths to VDJ csv files:
paths = 'data/VDJ/' + manifest['VDJ'] + '/filtered_contig_annotations.csv'
samples = dict(zip(paths, manifest['metadata']))
samples
The AnnData object must contain a metadata column (e.g. adata.obs['vdj_obs']) of the following format: cellbarcode + '_' + samplename. This can be constructed from the cell barcodes and sample names (provided that we have sample annotation):
#~ cellnames = adata.obs_names
#~ cellbarcode = cellnames.str.split("-").str[:2].str.join("-") # cell barcode part + '-1'
#~ adata.obs['vdj_obs'] = cellbarcode.astype(str) + "_" + adata.obs['Sample'].astype(str)
We then read V(D)J data into the AnnData object and create annotations. Note that values in the filtered_contig_annotations.csv files cannot be directly added as annotations, because one cell may have 0 to n values per entry. It will be stored in adata.uns and annotation will be generated separately.
adata = pyvdj.load_vdj(samples, adata, obs_col='vdj_obs', cellranger=2)
This loaded 10x V(D)J sequencing data (i.e. filtered_contig_annotations.csv files) into adata.uns['pyvdj']. obs_col specifies the annotation column for cellnames, as prepared above. For Cell Ranger version 3, set the parameter to 3.
For definitions of some words (clone, clonotype etc) used in the next section, see https://github.com/veghp/pyVDJ.
adata = pyvdj.add_obs(adata, obs=['clonotype', 'is_clone', 'is_productive'])
This generates annotation in adata.obs. Now we can plot V(D)J properties:
sc.settings.set_figure_params(dpi=70)
sc.pl.umap(adata, color=['vdj_has_vdjdata', 'vdj_is_clone', 'vdj_is_productive'])
allcell = len(adata.obs['vdj_has_vdjdata'])
vdjcell = sum(adata.obs['vdj_has_vdjdata'])
print('We have %d cells with VDJ data, out of %d cells.' % (vdjcell, allcell))
n_prod = sum(adata.obs['vdj_is_productive'] == 'True')
print('There are %d cells which have only productive chains.' % (n_prod))
This will be extended in the future to report cells which have no productive chain.
The following command adds one metadata column for each type of chain found in the V(D)J data:
adata = pyvdj.add_obs(adata, obs=['chains'])
sc.pl.umap(adata, color=['vdj_chain_TRA', 'vdj_chain_TRG', 'vdj_chain_IGH'])
Most of the cells have alpha / beta chain. This was observed in other 10x datasets too (not shown).
adata = pyvdj.add_obs(adata, obs=['genes']) # one annotation for each gene
sc.pl.umap(adata, color=['vdj_constant_TRBC1', 'vdj_constant_TRDC', 'vdj_constant_TRGC1'])
Similarly, we can add annotation for each V and each J gene:
adata = pyvdj.add_obs(adata, obs=['v_genes', 'j_genes'])
We can flag which cells are members of a clonotype with not fewer than n clones:
adata = pyvdj.add_gt_n(adata, n=40)
sc.pl.umap(adata, color='vdj_clonotype_gt_40')
We can see that the most expanded clonotypes are in the CCL5 (granzyme, etc)-expressing group.
adata = pyvdj.add_obs(adata, obs=['clone_count'])
sc.pl.umap(adata, color='vdj_clone_count')
The plot shows the number of clones in each cell's clonotype, and provides an alternative view of expanded clonotypes. Let's retrieve the CDR3 amino acid sequences of the above clonotypes:
clonotypes = adata.obs['vdj_clonotype_gt_40'].unique()[2:].tolist()
cdr3_dict = pyvdj.get_spec(adata, clonotypes)
cdr3_dict
This can be used to find their specificity in CDR3 databases, such as VDJdb.
Now we obtain statistics for each specified AnnData metadata category:
meta = 'metadata'
#~ adata.obs['metadata'] # one category for each 10x channel
adata = pyvdj.stats(adata, meta)
If ran the first time, this will add a dictionary in adata.uns['pyvdj']['cdr3'], which stores a set of productive CDR3s for each cell:
adata.uns['pyvdj']['cdr3']['cdr3sets']
adata.uns['pyvdj']['cdr3']['cdr3_codes'] # and a dictionary of {code: cdr3 set}
adata.uns['pyvdj']['cdr3']['cdr3_codes_rev'] # and the reverse dictionary of {cdr3set: code}
This was used to calculate the statistics that was added as a dictionary (adata.uns['pyvdj']['stats'][meta]):
adata.uns['pyvdj']['stats'][meta].keys()
See the readme for details on output.
# We define a plotting function. For easy customization, it is not included in the package.
def plot_cells(stats_dict):
hasvdj = stats_dict['cells'][1]
novdj = stats_dict['cells'][0] - hasvdj
bars = np.add(hasvdj, novdj).tolist()
x_axis = list(range(0, len(bars))) # bar positions on x
x_names = list(hasvdj.index)
barWidth = 1
p1 = plt.bar(x_axis, hasvdj, color='#553864', edgecolor='white', width=barWidth)
p2 = plt.bar(x_axis, novdj, bottom=hasvdj, color='#81a75d', edgecolor='white', width=barWidth)
plt.legend((p1[0], p2[0]), ('Has VDJ', 'No VDJ'), ncol=2, bbox_to_anchor=(0, 1), loc='lower left', fontsize='small')
plt.grid(False)
plt.xticks(x_axis, x_names, rotation='vertical')
plt.subplots_adjust(bottom=0.4, left=0.2)
plt.xlabel(stats_dict['meta'])
plt.ylabel('Number of cells')
# Now we can plot cell numbers:
plot_cells(adata.uns['pyvdj']['stats'][meta])
Plot clonotype distribution for each sample:
sc.settings.set_figure_params(dpi=50)
clonotype_dist = adata.uns['pyvdj']['stats'][meta]['clonotype_dist']
for meta_item in adata.obs[meta].unique().tolist():
print(meta_item)
clonotype_dist[meta_item].sort_index().plot('bar', figsize=(7, 5))
plt.subplots_adjust(bottom=0.3, left=0.2)
plt.grid(False)
plt.title(meta_item)
plt.ylabel('Number of cells')
plt.show()
plt.close()
Here, vdj_clone_count = 0 means that the cell did not have VDJ data, and vdj_clone_count = 1 means that the clonotype has only 1 cell (clone).
Apparently, more clonotypes are in the control (normal) sample (C), than in GOF (P1). This could be due to either decreased diversity in GOF, or expansion of a few clonotypes in GOF, which makes it more likely to sample those cells.
In our data, P2 is a non-adult sample, and that could explain the lack of clonotype expansion.
We can plot in a slightly different way, with x-axis being a proper number line:
x_max = max(adata.obs['vdj_clone_count'])
x_max = int(x_max*1.05)
samples_plot = ['c8_C_normal_Untreated_B', 'c3_P1_gof_Untreated_B', 'c6_P2_gof_Untreated_B', ]
sc.settings.set_figure_params(dpi=70)
for s in samples_plot:
df = adata.obs.loc[adata.obs[meta] == s]
x1 = df['vdj_clone_count']
plt.hist(x1, bins=range(0, x_max), width = 2, color='g', label=s)
plt.title(s)
plt.xlabel('vdj_clone_count')
plt.ylabel('Number of cells')
plt.show()
plt.close()
vdjdf = adata.uns['pyvdj']['df']
vdjdf.columns
obs_col = adata.uns['pyvdj']['obs_col'] # this is an optional step: subsets for cells in anndata
vdjdf = vdjdf.loc[vdjdf['barcode_meta'].isin(adata.obs[obs_col])]
for s in vdjdf['sample'].unique():
sd = dict(vdjdf.loc[(vdjdf['sample'] == s)].groupby('clonotype_meta').size())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for %s is %s.' % (s, shannon_index))
print('The Simpson index for %s is %s.' % (s, simpson_index))
print()
print()
Note that in the literature the inverse Simpson index (i.e. 1 / index ) is often (erroneously) referred to as the Simpson index.
The above can be expanded to calculate these indices separately for each metadata category:
vdjdf = vdjdf.loc[vdjdf['barcode_meta'].isin(adata.obs[obs_col])]
meta_cat = 'status'
for m in adata.obs[meta_cat].unique():
m_cells = adata.obs.loc[(adata.obs[meta_cat] == m)][obs_col]
vdjdf_m = vdjdf.loc[vdjdf['barcode_meta'].isin(m_cells)]
for s in vdjdf_m['sample'].unique():
sd = dict(vdjdf_m.loc[(vdjdf_m['sample'] == s)].groupby('clonotype_meta').size())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for sample %s in category %s is %s.' % (s, m, shannon_index))
print('The Simpson index for sample %s in category %s is %s.' % (s, m, simpson_index))
print()
print()
However, this will still treat each 10x channel as a separate donor, and clonotypes across 10x channels (but from the same donor) will not be combined. (In this context, donor means an individual organism (human or animal).) This is addressed in the next section:
sample_donor = dict(zip(adata.uns['pyvdj']['df']['sample'].unique(), ['P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'C', 'C', ]))
sample_donor
adata = pyvdj.find_clones(adata, sample_donor)
Using the CDR3 nucleotide sequences, the above combined clones within the same donor that were sequenced in different 10x channels and thus have different clonotype IDs. See:
adata.uns['pyvdj']['df']['donor_clonotype_meta']
adata.obs['vdj_donor_clonotype']
# Then we can calculate the indices:
for s in adata.obs['donor'].unique():
sd = dict(adata.obs.loc[adata.obs['donor'] == s]['vdj_donor_clonotype'].value_counts())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for %s is %s.' % (s, shannon_index))
print('The Simpson index for %s is %s.' % (s, simpson_index))
print()
print()
Let's find public CDR3s! (i.e. CDR3s shared between donors)
There are two ways to approach this:
We can obtain all CDR3 sequences for each condition, then find shared sequences. We provide an example in a section further below.
A much stricter approach is to find CDR3-combinations (clonotypes) shared between conditions. This is more accurate, because the combination of two CDR3 sequences define the specificity of the TCR. One major disadvantage is that it will return very few positives. For this, we calculate statistics grouped by donor (as opposed to sample, i.e. 10x channel):
meta = 'donor'
adata = pyvdj.stats(adata, meta)
cdr3 = adata.uns['pyvdj']['stats'][meta]['cdr3']
set(cdr3['C']) & set(cdr3['P1']) & set(cdr3['P2'])
As we can see, there are no clonotypes (as defined by the set of CDR3 sequences) shared between all 3 donors.
set(cdr3['P1']) & set(cdr3['P2']) - set(cdr3['C'])
As we can see here and below, this strict approach returned clonotypes with only one CDR3 sequenced. In any case, we can search these amino acid sequences in vdjDB. This one is not present in the database, which may mean it is a sequence potentially private (specific) to this condition. However, these databases are relatively small and capture a small part of the potential diversity of these sequences.
set(cdr3['C']) & set(cdr3['P1'])
'CASSLYNEQFF' is a CMV-specific sequence according to vdjDB. Cytomegalovirus (CMV) is a herpes virus that infects the majority of humans, therefore it is not surprising that a receptor against it was found in the two adult samples.
'CASSSTSGAYNEQFF' is not found in the database, but variants of it (with 1 aa difference) are specific to herpes viruses (CMV, EBV) and influenzA.
These searches will be more useful as these databases are extended with new data from 10x VDJ sequencing experiments. Note that this kind of analysis is complicated by that a TCR likely recognises many different antigens.
Perhaps the most useful part of the data is the GOF-specific CDR3s (that are not found in control):
len(set(cdr3['P1']) - set(cdr3['C']))
A simple visualization of the above:
cdr3_codes_rev = adata.uns['pyvdj']['cdr3']['cdr3_codes_rev']
cdr3_coded = {}
for key, value in cdr3.items():
print(key)
cdr3_coded[key] = [cdr3_codes_rev[cdr3] for cdr3 in value]
cdr3_coded.keys()
We create a multi-indexed table that can be used for the plotting function:
contents = upsetplot.from_contents(cdr3_coded)
contents
upsetplot.UpSet(contents, subset_size='count')
See this for help on reading upsetplots.
We can use get_spec() with a list of clonotypes for each category, or obtain sequences directly from the data, filtered for cells as shown here:
meta = 'status'
vdjdf = adata.uns['pyvdj']['df']
cdr3_simple_dict = {}
for m in adata.obs[meta].unique():
print(m)
cells = adata.obs.loc[adata.obs[meta] == m][obs_col].tolist()
cdr3_m = vdjdf.loc[vdjdf['barcode_meta'].isin(cells)]['cdr3']
cdr3_simple_dict[m] = set(cdr3_m)
cdr3_simple_dict.keys()
gof_only = cdr3_simple_dict['GOF'] - cdr3_simple_dict['Control']
len(gof_only)
The putative GOF-specific CDR3s can be searched for in databases, as shown above.
public_cdr3 = cdr3_simple_dict['GOF'] & cdr3_simple_dict['Control']
len(public_cdr3)
This returns a higher number of CDR3s shared between GOF and control. These are not specific to GOF.
Unfortunately, a bug in AnnData causes the saved adata.uns['pyvdj']['df'] to turn into a numpy array upon loading. Therefore, it is recommended to save it separately, and load when necessary:
adata.uns['pyvdj']['df'].to_pickle(path='adata.pyvdj.df.pkl')
# And anndata:
#~ adata.write('data/anndata.h5ad')